Introduction to Open Data Science. Happy to learn to connect Git and R even though the content of data analysis is familiar to me. I’ve used R MarkDown before and found it useful. In addition I’ve used Git before on Tietokantojen Perusteet course. However, it’s been a long time so recalling things is needed. Here’s the link to my IODS GitHub repository.
Today is the following day.
## [1] "Mon Nov 30 17:47:16 2020"
I wanted to remind me what I’ve done last time with R MarkDown. I found a nice data on exchange and forward rates. I make a table and a graph of Sterling/EUR exchange rate. I uploaded forward2.dat to git repository and I call my data set from there in order to plot the rate. I hide the R code to make the outcome more readable.
## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
| EXUSBP | EXUSEUR | EXEURBP | F1USBP | F1USEUR | F1EURBP | F3USBP | F3USEUR | F3EURBP | |
|---|---|---|---|---|---|---|---|---|---|
| Min. :1.073 | Min. :0.5827 | Min. :0.3567 | Min. :1.067 | Min. :0.5873 | Min. :0.3588 | Min. :1.061 | Min. :0.5941 | Min. :0.3624 | |
| 1st Qu.:1.507 | 1st Qu.:0.8876 | 1st Qu.:0.4855 | 1st Qu.:1.504 | 1st Qu.:0.8885 | 1st Qu.:0.4845 | 1st Qu.:1.501 | 1st Qu.:0.8926 | 1st Qu.:0.4841 | |
| Median :1.617 | Median :1.0738 | Median :0.5598 | Median :1.616 | Median :1.0774 | Median :0.5589 | Median :1.609 | Median :1.0855 | Median :0.5588 | |
| Mean :1.665 | Mean :1.0416 | Mean :0.6213 | Mean :1.663 | Mean :1.0447 | Mean :0.6203 | Mean :1.658 | Mean :1.0503 | Mean :0.6184 | |
| 3rd Qu.:1.756 | 3rd Qu.:1.1741 | 3rd Qu.:0.7107 | 3rd Qu.:1.753 | 3rd Qu.:1.1778 | 3rd Qu.:0.7091 | 3rd Qu.:1.748 | 3rd Qu.:1.1819 | 3rd Qu.:0.7051 | |
| Max. :2.443 | Max. :1.4222 | Max. :1.6002 | Max. :2.441 | Max. :1.4240 | Max. :1.5954 | Max. :2.433 | Max. :1.4278 | Max. :1.5863 |
I’m happy to see that R MarkDown is linked to LaTeX syntax! Here’s a maximization problem from my current research on the optimal mechanism design with enforcement: \[\begin{align} \max_{r(\cdot),t(\cdot),m(\cdot)} \mathbb{E} \left[ t(\theta) - K m(\theta) \right]\label{OBJ} \tag{MAX} \end{align}\] subject to \[\begin{align} &t(\theta) = \theta r(\theta) - V(\underline{\theta}) - \int_{\underline{\theta}}^{\theta} \mathcal{I}(s|s)ds \label{TAX} \tag{TAX} \\ &\mathcal{I}(\theta|\theta) \qquad \text{is nondecreasing} \label{IC} \tag{IC} \\ &\mathcal{I}(\theta|\theta) \geq 0 \label{IR} \tag{IR} \end{align}\] for all \(\theta \in \Theta\) with \(\mathcal{I}(\theta|\theta):=r(\theta) - m(\theta) \varphi\).
Today is
## [1] "Mon Nov 30 17:47:17 2020"
In this week we learn how to create data and analyse it. We read the data from the given URL. The meta descriptions can be found from here.
The dataset queries the relationship between learning approaches and students’ achievements in an introductory statistics course in Finland. The original data have 183 responders (observations) for 97 questions (variables). For our purposes, we subset the dataset to contain variables gender, age, attitude, deep, stra, surf and points. The variables are either integers or numerics. The meta-file gives the following description for each variable:
Next we omit the observations where the exam points variable is zero. After that we scale variables by the mean – that is, we divide each combination variable by its number of questions asked. So we have 166 observations for 7 variables left.
Let us next read the data we created earlier in the data wranling part:
# Read the data
learning2014 <- read.csv("/Users/teemupekkarinen/IODS-project/data/learning2014",row.names = 1)
To summarize our dataset’s content we run
# Structure
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ Gender : int 2 1 2 1 1 2 1 2 1 2 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ Deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ Stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ Surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
# Dimension
dim(learning2014)
## [1] 166 7
That is, we have the information about gender, age, attitude, and three different learning approaches (deep, strategic, and surface) of 166 responders. All of the variables are numeric and the combination variables (attitude, deep, stra, and surf) are means. For more detailed description we run:
# Summary
summary(learning2014)
## Gender Age Attitude Deep
## Min. :1.000 Min. :17.00 Min. :1.400 Min. :1.583
## 1st Qu.:1.000 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Median :2.000 Median :22.00 Median :3.200 Median :3.667
## Mean :1.663 Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:2.000 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :2.000 Max. :55.00 Max. :5.000 Max. :4.917
## Stra Surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
We observe that there are almost twice more female responders than male responders: 56 men and 110 women. The average age of the responders is 26 and the youngest person is 17 years old and the oldest 55 years old. The age distribution is wide but skewed towards young people. As summarize we can say than a typical responder is a young female.
# Histrograms
Gender <- learning2014$Gender
hist(Gender,breaks=2)
sum(Gender==1)
## [1] 56
Age <- learning2014$Age
hist(Age)
mean(Age)
## [1] 25.51205
var(Age)
## [1] 60.31198
The average exam points is 28 with maximum 33 and minimum 7. The exam results are more or less normally distributed among all participants.
# Histrograms
Points <- learning2014$Points
hist(Points)
However, when we compare the points between men and female, we observe that the mean of exam points for men is slightly greater than for women. There are even more men with score 30 or 31 than women.
# Package
library(ggplot2)
# Histrograms
PointsMale <- learning2014$Points[Gender==1]
mean(PointsMale)
## [1] 23.48214
PointsFemale <- learning2014$Points[Gender==2]
mean(PointsFemale)
## [1] 22.32727
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(PointsMale, PointsFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
The mean of attitude, 3.14, is almost neutral 3. That is, on average responders have a bit positive attitude towards statistics. Also attitude is nearly normally distributed. Male responders have more positive attitude than female.
# Histrograms
Attitude <- learning2014$Attitude
hist(Attitude)
AttitudeMale <- learning2014$Attitude[Gender==1]
mean(PointsMale)
## [1] 23.48214
AttitudeFemale <- learning2014$Attitude[Gender==2]
mean(PointsFemale)
## [1] 22.32727
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(AttitudeMale, AttitudeFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
The minority prefers surface learning approach to learning in the statistics course, whereas the majority prefers the methods of deep learning. Moreover, the greatest variation in learning methods is in strategic learning. Deep and surface learning is almost identically distributed expect deep learning has a greater mean.
# Package
library(ggplot2)
# Histrograms
Deep <- learning2014$Deep
Stra <- learning2014$Stra
Surf <- learning2014$Surf
data <- data.frame(
type = c( rep("Deep", 166), rep("Stra", 166), rep("Surf", 166) ),
value = c(Deep, Stra, Surf)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
There is no much difference between genders in learning methods; the distributions between male and female responders are almost indentically distributed.
# Histrograms
# Strategic Learning
StraMale <- learning2014$Stra[Gender==1]
StraFemale <- learning2014$Stra[Gender==2]
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(StraMale, StraFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
# Deep Learning
DeepMale <- learning2014$Deep[Gender==1]
DeepFemale <- learning2014$Deep[Gender==2]
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(DeepMale, DeepFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
# Surface Learning
SurfMale <- learning2014$Surf[Gender==1]
SurfFemale <- learning2014$Surf[Gender==2]
data <- data.frame(
type = c( rep("Male", 56), rep("Female", 110) ),
value = c(SurfMale, SurfFemale)
)
ggplot(data, aes(x=value, fill=type)) +
geom_histogram(binwidth=.5, alpha=.5, position="identity")
Next we run regressions where exam points is our dependent variable. Let us first look at the correlations between exam points and the explanatory variables.
plot(Attitude, Points, main = "Regression",
xlab = "Attitude", ylab = "Points") + abline(lm(Points~Attitude), col = "blue")
## integer(0)
plot(Gender, Points, main = "Regression",
xlab = "Gender", ylab = "Points") + abline(lm(Points~Gender), col = "blue")
## integer(0)
plot(Age, Points, main = "Regression",
xlab = "Age", ylab = "Points") + abline(lm(Points~Age), col = "blue")
## integer(0)
plot(Deep, Points, main = "Regression",
xlab = "Deep Learning", ylab = "Points") + abline(lm(Points~Deep), col = "blue")
## integer(0)
plot(Stra, Points, main = "Regression",
xlab = "Strategic Learning", ylab = "Points") + abline(lm(Points~Stra), col = "blue")
## integer(0)
plot(Surf, Points, main = "Regression",
xlab = "Surface Learning", ylab = "Points") + abline(lm(Points~Surf), col = "blue")
## integer(0)
cov(learning2014)
## Gender Age Attitude Deep Stra Surf
## Gender 0.22489960 -0.4383352 -0.10184739 -0.01526713 0.05326762 0.02826457
## Age -0.43833516 60.3119752 0.12584520 0.10792260 0.61406079 -0.58075332
## Attitude -0.10184739 0.1258452 0.53276561 0.04458987 0.03478322 -0.06776013
## Deep -0.01526713 0.1079226 0.04458987 0.30706766 0.04127419 -0.09489017
## Stra 0.05326762 0.6140608 0.03478322 0.04127419 0.59572437 -0.06570525
## Surf 0.02826457 -0.5807533 -0.06776013 -0.09489017 -0.06570525 0.27967222
## Points -0.25972983 -4.2662651 1.87824388 -0.03315078 0.66483662 -0.45002434
## Points
## Gender -0.25972983
## Age -4.26626506
## Attitude 1.87824388
## Deep -0.03315078
## Stra 0.66483662
## Surf -0.45002434
## Points 34.74965316
From the single regressions and the covariance matrix we observe that there is negative correlation between points and gender, age, deep learning and surface learning. That is, what we already observed, men were doing better in the course than women. Moreover, it seems to be so that younger students got better exam results than the older participants. Strategic learning was the only method that has positive correlation between exam results. There is a big positive correaltion between attitude and points.
Next we choose first gender, attitude and age to regress exam points.
reg <- lm(Points~Gender+Attitude+Age)
summary(reg)
##
## Call:
## lm(formula = Points ~ Gender + Attitude + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.76802 3.17695 4.019 8.93e-05 ***
## Gender 0.33054 0.91934 0.360 0.720
## Attitude 3.60657 0.59322 6.080 8.34e-09 ***
## Age -0.07586 0.05367 -1.414 0.159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
From here we observe that attitude is the only significant explanatory variable by itself. However, by the F-test all three variables are together significant explanatories.
Nevertheless, following the instructions we drop off gender variable and run the regression again.
# Regression
reg <- lm(Points~Attitude+Age)
summary(reg)
##
## Call:
## lm(formula = Points ~ Attitude + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3354 -3.3095 0.2625 4.0005 10.4911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.57244 2.24943 6.034 1.04e-08 ***
## Attitude 3.54392 0.56553 6.267 3.17e-09 ***
## Age -0.07813 0.05315 -1.470 0.144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.301 on 163 degrees of freedom
## Multiple R-squared: 0.2011, Adjusted R-squared: 0.1913
## F-statistic: 20.52 on 2 and 163 DF, p-value: 1.125e-08
Attitude remains to be the strongest predictor. Age is still unsignifant even though together with attitude it is significant (F-test). It turns out, after trying all possible combinations of explanatory variables, only regressing with attitude we get 5 % significant level of all regressors. We thus lastly regress only with attitude and get the following.
reg <- lm(Points~Attitude)
summary(reg)
##
## Call:
## lm(formula = Points ~ Attitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## Attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
From the summary we observe that the intercept term is positive and significant: 11.63 with standard error of 1.83. This is the test score if attitude was “zero”. Each unit jump in attitution increases (not causal) exam results by 3.5 points. Hence, with maximum attitude our model predicts that the test result is \(11.63 + 5 * 3.5 = 29\) points.
We observe also that R-squared decreases a little everytime we omit an explanatory variable from the regression. R-squared is the measure that represents the proportion of the variance for the dependent variable that is explained by explanatory variables. The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.
Lastly we plot the regression diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
plot(reg)
Residuals vs Fitted values figure is a simple scatterplot between residuals and predicted values. It should look more or less random, which is roughly the case in our analysis. The red line is not exactly zero all the time, which is a sign of that the residuals may have a little positive predictive power (omitted variable bias).
The Normal QQ plot helps us to assess whether the residuals are roughly normally distributed. In our cases the tails scarper from the diagonal which may imply that we have some problems with assumption of normally distributed error terms. Probably some t-distribution could be better.
The last plot for the residuals vs leverage (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). It turns out that points 35, 56, and 71 have the strongest effects on the dependent variable.
For the causal inference we need to have very strong assumptions.
Before going to the assumptions, we introduce the vector and matrix notation. Define K-dimensional (column) vectors \(\boldsymbol{x}_i\) and \(\boldsymbol{\beta}\) as
\[ \underset{(K \times 1)}{\boldsymbol{x}_i} = \begin{pmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{iK} \end{pmatrix}, \underset{(K \times 1)}{\boldsymbol{\beta}} = \begin{pmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{K} \end{pmatrix}.\] Also define \[ \underset{(n \times 1)}{\boldsymbol{y}} = \begin{pmatrix} y_{1} \\ \vdots \\ y_{n} \end{pmatrix}, \underset{(n \times 1)}{\boldsymbol{\varepsilon}} = \begin{pmatrix} \varepsilon_{1} \\ \vdots \\ \varepsilon_{n} \end{pmatrix}, \underset{(n \times K)}{\boldsymbol{X}} = \begin{pmatrix} \boldsymbol{x}'_{1} \\ \vdots \\ \boldsymbol{x}'_{n} \end{pmatrix} = \begin{pmatrix} x_{11} & \dots & x_{1K} \\ \vdots & \ddots & \vdots \\ x_{n1} & \dots & x_{nK} \end{pmatrix}. \] Scalar quantities will be given in normal font \(x\), vector quantities in bold lowercase \(\boldsymbol{x}\), and all vectors will be presumed to be column vectors. Matrix quantities will be in bold uppercase \(\boldsymbol{X}\). The transpose of a matrix is denoted by either \(\boldsymbol{X}'\) or \(\boldsymbol{X}^T\).
Using the matrix notation, the assumptions for the causal inference are the following
Assumption 1.1 (linearity): \[\underset{(n \times 1)}{\boldsymbol{y}} = \underset{\underbrace{(n \times K)(K \times 1)}_{(n \times 1)}}{\boldsymbol{X} \: \: \: \boldsymbol{\beta}} + \underset{(n \times 1)}{\boldsymbol{\varepsilon}}.\]
Assumption 1.2 (strict exogeneity): \[\mathbb{E}(\varepsilon_{i}|\boldsymbol{X}) = 0 \: \: \: (i = 1, 2, \dots, n).\]
Assumption 1.3 (no multicollinearity): The rank of the \(n \times K\) data matrix, \(\boldsymbol{X}\), is \(K\) with probability 1.
Assumption 1.4 (spherical error variance): \[(\text{homoskedasticity}) \: \: \mathbb{E}(\varepsilon_{i}^2|\boldsymbol{X}) = \sigma^2 > 0 \: \: \: (i = 1, 2, \dots, n),\] \[(\text{no correlation between observations}) \: \: \: \mathbb{E}(\varepsilon_{i}\varepsilon_{j}|\boldsymbol{X}) = 0 \: \: \: (i, j = 1, 2, \dots, n; i \neq j).\]
Even though our regression diagnostic plots are promising, I would not make any kind of causal interpretations from our model.
Today is
## [1] "Mon Nov 30 17:47:20 2020"
This week we use the Boston data from the MASS package. The data have 14 variables and 506 observartions for each variable. The variables are either numerical or integers.
# Package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Namely, the variables are
The correlation matrix can be illustrated by using corrplot package as follows
library(corrplot)
## corrplot 0.84 loaded
corr_matrix<-cor(Boston)
corrplot(corr_matrix)
That is, there is a a significant positive correlation between crim, rad, tax and lsat. On the other hand, the weighted mean of distances to five Boston employment centres, dis, has negative correlation with indus, nox, and age.
However, since the crime rate seems to be the variable in interest, let us illustrate it by some graphics.
require(ggplot2)
require(plotly)
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(data = Boston, x = ~crim, type = "histogram")
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
plot_ly(data = Boston, x = ~rad, y = ~crim)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot_ly(data = Boston, x = ~tax, y = ~crim)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plot_ly(data = Boston, x = ~lstat, y = ~crim)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Next we standardize the dataset and print out summaries of the scaled data.
scaled.Boston <- data.frame(scale(Boston))
Now all variables have mean of \(0\) and variance \(1\). For instance, now the plot of crim and lstat is the following.
plot_ly(data = scaled.Boston, x = ~lstat, y = ~crim)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Next we divide crim in 5 categories based on the 20/%-quantiles.
library(gtools)
q.crim <- quantcut(scaled.Boston$crim,q = 5)
summary(q.crim)
## [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356] (-0.356,0.229] (0.229,9.92]
## 102 101 101 101 101
scaled.Boston$crim <- q.crim
We divide the dataset into train and test sets, so that 80/% of the data belongs to the train set.
sample <- sample.int(n = nrow(scaled.Boston), size = floor(.8*nrow(scaled.Boston)), replace = F)
train <- scaled.Boston[sample, ]
test <- scaled.Boston[-sample, ]
The linear discriminant analysis:
# MASS and train are available
# linear discriminant analysis
lda.fit <- lda(crim~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crim ~ ., data = train)
##
## Prior probabilities of groups:
## [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356] (-0.356,0.229] (0.229,9.92]
## 0.1955446 0.2227723 0.2054455 0.1757426 0.2004950
##
## Group means:
## zn indus chas nox rm
## [-0.419,-0.413] 1.05905047 -0.9509521 -0.12281893 -0.9133387 0.39416615
## (-0.413,-0.403] 0.03371702 -0.4056968 -0.09734683 -0.6446027 0.12856544
## (-0.403,-0.356] -0.27982850 -0.2095727 0.05971553 -0.2802800 -0.02671002
## (-0.356,0.229] -0.43892792 0.5362692 0.28219209 0.7180533 -0.12342269
## (0.229,9.92] -0.48724019 1.0149946 -0.07790436 0.9961228 -0.42930121
## age dis rad tax ptratio
## [-0.419,-0.413] -0.91773265 0.96904395 -0.6998426 -0.7770316 -0.33261371
## (-0.413,-0.403] -0.47783307 0.34605300 -0.5748034 -0.5516981 -0.24018066
## (-0.403,-0.356] -0.02436136 0.07623766 -0.4934268 -0.4726487 -0.00617333
## (-0.356,0.229] 0.59001312 -0.48434385 0.1002759 0.2183798 -0.06078210
## (0.229,9.92] 0.85108940 -0.88894827 1.6596029 1.5294129 0.80577843
## black lstat medv
## [-0.419,-0.413] 0.38516059 -0.78033641 0.44016098
## (-0.413,-0.403] 0.35652695 -0.41094994 0.26898253
## (-0.403,-0.356] 0.29220232 0.01910848 0.08891887
## (-0.356,0.229] -0.09784365 0.04189717 -0.01107092
## (0.229,9.92] -0.77842689 0.98718228 -0.71272745
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4
## zn -0.02521669 8.393331e-01 0.88898002 -0.2974854
## indus 0.09822684 -2.416469e-01 0.00770177 0.6113110
## chas -0.04151466 -9.653895e-02 0.03577411 -0.1548408
## nox 0.36675222 -9.529747e-01 1.50359781 -0.8470891
## rm 0.06169627 2.247672e-01 -0.27364067 -0.2364587
## age 0.26090519 -2.815501e-01 0.32650439 -0.1431015
## dis -0.06768049 -5.881875e-01 0.46977261 -0.3946500
## rad 1.81215182 1.140856e+00 -0.09524445 0.2362642
## tax 0.04459275 -2.865921e-01 -0.29220737 0.5052874
## ptratio -0.01289009 6.872258e-02 0.19319037 -0.8423524
## black -0.14105394 -4.813002e-05 -0.17544577 -0.0688977
## lstat 0.27775103 1.183123e-01 -0.95373823 -1.1834087
## medv 0.02949218 -4.423697e-01 -0.04594082 -0.5299873
##
## Proportion of trace:
## LD1 LD2 LD3 LD4
## 0.8841 0.0733 0.0363 0.0064
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
Next we cross tabulate the results with the crime categories from the test set.
# lda.fit, correct_classes and test are available
correct_classes <- test$crim
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct [-0.419,-0.413] (-0.413,-0.403] (-0.403,-0.356]
## [-0.419,-0.413] 14 6 3
## (-0.413,-0.403] 3 7 1
## (-0.403,-0.356] 0 6 12
## (-0.356,0.229] 0 1 6
## (0.229,9.92] 0 0 0
## predicted
## correct (-0.356,0.229] (0.229,9.92]
## [-0.419,-0.413] 0 0
## (-0.413,-0.403] 0 0
## (-0.403,-0.356] 0 0
## (-0.356,0.229] 13 10
## (0.229,9.92] 0 20
Let us next reload the Boston dataset and standardize it. Then we calculate the distances between the observations.
# load MASS and Boston
library(MASS)
data('Boston')
# standardization
Boston <- scale(Boston)
# euclidean distance matrix
dist_eu <- dist(Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(Boston, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal.
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
Today is
## [1] "Mon Nov 30 17:47:28 2020"
This week we use the ‘human’ dataset originates from the United Nations Development Programme. Graphical overview of the data and show summaries of the variables in the data is the following. We have 155 observations and 8 variables.
# Read the data
human <- read.csv("/Users/teemupekkarinen/IODS-project/data/human",row.names = 1)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 0.198 0.931 0.861 0.977 0.989 ...
## $ Labo.FM : num 0.199 0.685 0.211 0.633 0.747 ...
## $ Edu.Exp : num 9.3 11.8 14 17.9 12.3 20.2 15.7 11.9 12.6 14.4 ...
## $ Life.Exp : num 60.4 77.8 74.8 76.3 74.7 82.4 81.4 70.8 75.4 76.6 ...
## $ GNI : int 1885 9943 13054 22050 8124 42261 43869 16428 21336 38599 ...
## $ Mat.Mor : int 400 21 89 69 29 6 4 26 37 22 ...
## $ Ado.Birth: num 86.8 15.3 10 54.4 27.1 12.1 4.1 40 28.5 13.8 ...
## $ Parli.F : num 27.6 20.7 25.7 36.8 10.7 30.5 30.3 15.6 16.7 15 ...
dim(human)
## [1] 155 8
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Namely, the variables are
The data sorted by the GNI
set.seed(1)
s <- sample(1:nrow(human),20)
subdata <- human[s,]
sort.GNI <- subdata$GNI[order(subdata$GNI,decreasing = T)]
barplot(sort.GNI, names.arg = row.names(subdata[order(subdata$GNI,decreasing = T),]), las=2)
The scatter plot of Life.Exp and GNI
library(ggplot2)
library(ggrepel)
list <- c("Australia",
"Azerbaijan","Belgium","Canada", "Chile", "China", "Cuba", "Denmark",
"Finland", "France", "Spain", "United States", "United Kingdom", "Germany", "India", "Greece", "Ghana", "Israel", "Hungary", "Niger", "Argentina", "Mexico", "Venezuela", "Mongolia", "Morocco", "Nepal", "Namibia", "Pakistan", "Peru", "Philippines", "Romania", "Tajikistan", "Tunisia", "Senegal", "Zambia")
s <- which(rownames(human) %in% list)
p <- ggplot(human[s,], aes(Life.Exp, GNI)) +
geom_point(color = 'red') +
theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'Life expectancy at birth')
p + geom_text_repel(aes(label = rownames(human[s,])),
size = 3.5)
The scatter plot of Edu.Exp and GNI
p <- ggplot(human[s,], aes(Edu.Exp, GNI)) +
geom_point(color = 'red') +
theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'Expected years in schooling')
p + geom_text_repel(aes(label = rownames(human[s,])),
size = 3.5)
The scatter plot of Edu2.FM and GNI
p <- ggplot(human[s,], aes(Edu2.FM, GNI)) +
geom_point(color = 'red') +
theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'The ratio of Female and Male populations with secondary education')
p + geom_text_repel(aes(label = rownames(human[s,])),
size = 3.5)
The scatter plot of Labo.FM and GNI
p <- ggplot(human[s,], aes(Labo.FM, GNI)) +
geom_point(color = 'red') +
theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'The ratio of labour force participation of females and males')
p + geom_text_repel(aes(label = rownames(human[s,])),
size = 3.5)
The scatter plot of Parli.F and GNI
p <- ggplot(human[s,], aes(Parli.F, GNI)) +
geom_point(color = 'red') +
theme_classic(base_size = 10) + labs(y = 'Gross National Income per capita', x = 'Women\'s participation in parliament (percent)')
p + geom_text_repel(aes(label = rownames(human[s,])),
size = 3.5)
To summarize the correlations betweem variables let us plot the correlation matrix.
library(corrplot)
corr_matrix<-cor(human)
corrplot(corr_matrix)
Principal Component Analysis (PCA) can be performed by two sightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD).
Both methods quite literally decompose a data matrix into a product of smaller matrices, which let’s us extract the underlying principal components. This makes it possible to approximate a lower dimensional representation of the data by choosing only a few principal components.
Let us next perform the PCA for our human data.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM 5.607472e-06 -0.0006713951 -3.412027e-05 -2.736326e-04 0.0022935252
## Labo.FM -2.331945e-07 0.0002819357 5.302884e-04 -4.692578e-03 -0.0022190154
## Edu.Exp 9.562910e-05 -0.0075529759 1.427664e-02 -3.313505e-02 -0.1431180282
## Life.Exp 2.815823e-04 -0.0283150248 1.294971e-02 -6.752684e-02 -0.9865644425
## GNI 9.999832e-01 0.0057723054 -5.156742e-04 4.932889e-05 0.0001135863
## Mat.Mor -5.655734e-03 0.9916320120 1.260302e-01 -6.100534e-03 -0.0266373214
## Ado.Birth -1.233961e-03 0.1255502723 -9.918113e-01 5.301595e-03 -0.0188618600
## Parli.F 5.526460e-05 -0.0032317269 -7.398331e-03 -9.971232e-01 0.0716401914
## PC6 PC7 PC8
## Edu2.FM 2.180183e-02 -6.998623e-01 7.139410e-01
## Labo.FM 3.264423e-02 -7.132267e-01 -7.001533e-01
## Edu.Exp 9.882477e-01 3.826887e-02 7.776451e-03
## Life.Exp -1.453515e-01 -5.380452e-03 2.281723e-03
## GNI -2.711698e-05 8.075191e-07 -1.176762e-06
## Mat.Mor 1.695203e-03 -1.355518e-04 8.371934e-04
## Ado.Birth 1.273198e-02 8.641234e-05 -1.707885e-04
## Parli.F -2.309896e-02 2.642548e-03 2.680113e-03
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
library(ggfortify)
pca.plot <- autoplot(pca_human, data = human, colour = 'GNI')
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
pca.plot
And the same for standarized data:
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(scale(human))
pca_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM 0.35664370 -0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM -0.05457785 -0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp 0.42766720 -0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp 0.44372240 0.02530473 0.10991305 -0.05834819 0.1628935
## GNI 0.35048295 -0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor -0.43697098 -0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth -0.41126010 -0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F 0.08438558 -0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
pca.plot <- autoplot(pca_human, data = scale(human), colour = 'GNI')
pca.plot
In PCA scale is necessary as it removes the biases in the original variables. Especially in our case in which we have variables in different units. The standardized variables will be unitless and have a similar variance, which allows to divide the data in components that try to explain the variance of the data as much as possible(!). Hence, it is reasonable to interpret only the results for the standardised data.
The first principal component is able to capture 54 percent of the total variability of the data. The second component 16 percent and the third 10 percent. The lower components explain less than 10 percent in total of 30 percent. That is, three first principal components can explain 80 percent of the total variability of the whole human data.
From the biplot and loadings plot, we can see the variables Labo.FM and Parli.F are highly associated and form a cluster. On the other hand, variables Ado.Birth and Mat.Mor form a cluster and are associated as well. And the third and the greatest cluster is formed by Edu.Exp, Life.Exp,Edu2.FM, and GNI. In clusters there are a strong positive correlation between the variables. The length of PCs in biplot refers to the amount of variance contributed by the PCs. The longer the length of PC, the higher the variance contributed and well represented in space. If the variables are highly associated, the angle between the variable vectors should be as small as possible in the biplot. It seems to be so that the clusters are divided in three: women rights, child mortality, and education. The first principal component explains the variation in child mortality and education, and the second principal component the women rights.
Lastly we load the tea dataset from the package FactoMineR and explore the data briefly. The data used here concern a questionnaire on tea. They asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions). A data frame with 300 rows and 36 columns. Rows represent the individuals, columns represent the different questions. The first 18 questions are active ones, the 19th is a supplementary quantitative variable (the age) and the last variables are supplementary categorical variables.
library(FactoMineR)
library(dplyr)
library(tidyr)
# reduced dataset
data(tea)
tea_time <- select(tea, c("Tea", "How", "how", "sugar", "where", "lunch"))
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
dim(tea_time)
## [1] 300 6
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
To summarize the data we find the following. People mostly drink tea in tea bags and without lemon and milk. The most common time to drink tea is not lunch time. People are divided between drinking tea with and without sugar. Earl Grey is the most popular brand and it people buy their tea from chain stores.
Based on the MCA we can conclude that unpacked tea is bought from tea shops and tea bags from chain stores. There is a cluster of drinking tea in outside of lunch time without any additional ingredients (milk or lemon). Moreover, it seems to be so that there is a little correlation between drinking Earl Grey with milk. Green tea is
Today is
## [1] "Mon Nov 30 17:47:34 2020"
This week we practise the analysis of longitudinal data. We will make graphical displays and summaries. We use the data RATS given by here: The data is from a nutrition study conducted in three groups of rats (Crowder and Hand, 1990). The three groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately weekly, except in week seven when two recordings were taken) over a 9-week period.
# Access the packages dplyr and tidyr
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
# Read the data
RATSL <- read.csv("/Users/teemupekkarinen/IODS-project/data/RATSL",row.names = 1,stringsAsFactors = T)
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: int 1 1 1 1 1 1 1 1 2 2 ...
## $ Times: Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Rats : int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
dim(RATSL)
## [1] 176 5
summary(RATSL)
## ID Group Times Rats Time
## Min. : 1.00 Min. :1.00 WD1 :16 Min. :225.0 Min. : 1.00
## 1st Qu.: 4.75 1st Qu.:1.00 WD15 :16 1st Qu.:267.0 1st Qu.:15.00
## Median : 8.50 Median :1.50 WD22 :16 Median :344.5 Median :36.00
## Mean : 8.50 Mean :1.75 WD29 :16 Mean :384.5 Mean :33.55
## 3rd Qu.:12.25 3rd Qu.:2.25 WD36 :16 3rd Qu.:511.2 3rd Qu.:50.00
## Max. :16.00 Max. :3.00 WD43 :16 Max. :628.0 Max. :64.00
## (Other):80
# Factor ID & Group
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
# Take a glimpse at the RATSL data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ Times <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1…
## $ Rats <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
# Table
kable(RATSL[1:16,])
| ID | Group | Times | Rats | Time |
|---|---|---|---|---|
| 1 | 1 | WD1 | 240 | 1 |
| 2 | 1 | WD1 | 225 | 1 |
| 3 | 1 | WD1 | 245 | 1 |
| 4 | 1 | WD1 | 260 | 1 |
| 5 | 1 | WD1 | 255 | 1 |
| 6 | 1 | WD1 | 260 | 1 |
| 7 | 1 | WD1 | 275 | 1 |
| 8 | 1 | WD1 | 245 | 1 |
| 9 | 2 | WD1 | 410 | 1 |
| 10 | 2 | WD1 | 405 | 1 |
| 11 | 2 | WD1 | 445 | 1 |
| 12 | 2 | WD1 | 555 | 1 |
| 13 | 3 | WD1 | 470 | 1 |
| 14 | 3 | WD1 | 535 | 1 |
| 15 | 3 | WD1 | 520 | 1 |
| 16 | 3 | WD1 | 510 | 1 |
# Draw the plot
ggplot(RATSL, aes(x = Time, y = Rats, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Rats), max(RATSL$Rats)))
That is, there are different 16 rats for which the table above gives the initial weight in day 1. The first figure shows how the weight of rats in each group evolve over the test period. Clearly there are differences in weights between groups. Moreover, in group 2 there is a clear outlier of a huge rat.
# Standardise the variable Rats
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdRats = (Rats - mean(Rats))/sd(Rats) ) %>%
ungroup()
# Glimpse the data
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, …
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1…
## $ Times <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, W…
## $ Rats <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 4…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8…
## $ stdRats <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, -…
# Plot again with the standardised Rats
ggplot(RATSL, aes(x = Time, y = stdRats, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized Rats")
# Number of time periods, baseline (time 1) included
n <- RATSL$Time %>% unique() %>% length()
# Summary data with mean and standard error of Rats by Group and Time
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Rats), se = sd(Rats)/sqrt(n) ) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.3…
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype=Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=4) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.5)) +
scale_y_continuous(name = "mean(Rats) +/- se(Rats)")
# Create a summary data by Group and ID with mean as the summary variable (ignoring baseline time period 1).
RATSL8S <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Rats) ) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Glimpse the data
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …
# Draw a boxplot of the mean versus treatment
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Rats), time periods 2-64")
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL8S1 <- RATSL8S %>%
filter(mean < 550)
# Glimpse the data
glimpse(RATSL8S1)
## Rows: 15
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …
# Draw a boxplot of the mean versus treatment
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Rats), time periods 2-64")